In [1]:
from __future__ import print_function, division
In [2]:
import matplotlib.pyplot as plt
import pandas as pd
from matplotlib.style import use
import numpy as np
import matplotlib as mpl
%matplotlib inline
use("ggplot")
PRSA = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00381/PRSA_data_2010.1.1-2014.12.31.csv",
index_col=0)
bike_sharing = pd.read_csv("day.csv", index_col=0)
In [3]:
PRSA.head()
Out[3]:
In [4]:
plt.plot( PRSA.TEMP, PRSA["pm2.5"], 'o', color="steelblue", alpha=0.5)
plt.ylabel("$\mu g/m^3$")
plt.title("PM 2.5 readings as a function of temperature (Celsius)")
Out[4]:
As one can see, there's not much one can say about the structure of the data because every point below 400 $\mu g/m^3$ is totally filled up with blue color.
It is here, that a density plot helps mitigate this problem. The central idea is that individual data points are not so important in as much as they contribute to revealing the underlying distribution of the data. In other words, for large amounts of data, we want to visualize the distribution instead of visualizing how individual datapoints are placed.
For this lesson, we will look at this data set and others to investigate the use of other plotting functions in matplotlib
.
Histograms are created using the hist
command. We illustrate this using the PRSA data set. Let's say that we are interested in plotting the distribution of DEWP
.
In [ ]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_title("Histogram of dew point readings")
ax.set_xlabel("Dew point (C$^\circ$)")
# Enter plotting code below here: ax.hist(PRSA.DEWP)
So how was this produced?
plt.figure
. fig.add_subplot
. We passed an argument 111
to the function which is a short hand for creating only one axes.set_title
function and also label the x -axis with its label. hist
method on the axes instance. We merely need to pass the (1-dimensional) array of data to the function. Notice the output that is produced. It's not very nice and probably needs some prettying up. But as a quick exploratory plot, it does its job. Notice the extra textual output. These can be suppressed with the ;
written at the end of the last statement in the code block.
The chart above can be customized to our liking by passing keyword arguments to the function.
Number of bins. The number of bins may be adjusted with the bins=
keyword argument. However, do note that more bins does not translate into a better chart. With more bins, one tends to pick up much more variation between data (noise) that what is necessary. Therefore, try to choose a value which will give you the best sense of how the data is dristributed between the extremes of no details (small bins
value) to a noisy chart (high bins
value).
Normalization. Setting normed=True
(it is False
by default) means that the total area of the histogram is set to 1. This setting is useful to compare distributions of variables on different orders of magnitude.
Is a log scale needed?. log=True
may be used if you want the count (or relative counts) to be plotted on a log scale. This may be useful if the counts in different bins differ by huge orders of magnitude. This is especially true for data modelled by to power law distributions.
Cumulative sums. Sometimes, you want to plot the ogive instead. Enable this by setting cumulative=True
.
Colors. Selecting the best color, especially when comparing multiple histograms on the same axis is crucial. You have choose different colors using color=
argument. You may use any hexadecimal color codes (e.g. ##660000
) or CSS color names, matplotlib color abbrevations (e.g. c
for cyan
, m
for magenta, b
for blue etc..)
Lines between bars. A histogram is more presentable is one draws lines between bars. Enable this by setting lw=
to an appropraite thickness (any value around 0.5 is ok) and giving it a color by setting ec=
.
Transparency control. This is useful if there are multiple histograms. Use alpha=
and enter any value from 0 (fully transparent) to 1 (fully opaque).
In [ ]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_title("Histogram of dew point readings")
ax.set_xlabel("Dew point (C$^\circ$)")
# Try typing in ax.hist(PRSA.DEWP, ...) with various customization options.
More options can be found of the documentation page.
The hist
function is not only used to plot histograms. In essence, it is a function used to plot rectangular patches on an axis. Thus, we use hist
to plot bar charts. In fact, we may use it to plot stacked bar charts, which is something seaborn
cannot do.
In the following dataset, we want to plot the distribution of daily bike rental counts (variable cnt
) on any given day of the week and seperate them by the variable weathersit
which is an ordinal variable denoting the severity of the weather. 1 means good weather, 4 means bad.
In [7]:
bike_sharing.head()
Out[7]:
Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
As you can see in the dataset above, we need to plot weekday
on the x-axis and have cnt
as the y-axis. How can we plot this using hist
? The problem is that if we pass the weekday
array to hist
, we end up counting the frequency of each day in the dataset!
To solve this, we pass the cnt
variable as a seperate parameter to hist
through the keyword argument weights
.
In [28]:
# Please run this cell before proceeding
group = bike_sharing.groupby("weathersit")
weathers = [w_r for w_r, _ in group]
day_data = [group.get_group(weather).weekday for weather in weathers]
weights_data = [group.get_group(weather).cnt for weather in weathers]
What we need to do is to group the data set by weather rating and create a list of array data to be passed to hist
. We can do this efficiently using the groupby
method on data frames and list comprehension statements. Now we have three lists: One for weather rating, one for the day of the week and one more for the daily bike rental counts.
We pass this to the hist
function and pass a sequence of floats [-0.5, 0.5,..., 6.5]
so that each bar is nicely centered on the tick mark. We also plot this count on a log scale (set log=True
) because we expect quite a large difference between bike rental counts in good weather as compared to bad. Without this, it is very difficult to see any variation for rentals during bad weather.
In [ ]:
fig2 = plt.figure()
ax2 = fig2.add_subplot(111)
ax2.set_title("Distribution of daily bike rental counts by weather conditions", y=1.05)
ax2.set_xlabel("Day")
ax2.set_ylabel("Bike rental counts")
ax2.hist( , # day_data
label= , # weathers
weights= , # weights_data
log= , # We want the y-axis to be on a log scale
bins=np.linspace(-0.5,6.5,8),
histtype="bar", # this is the default setting.
ec="k", lw=1.0,
)
ax2.legend(loc="lower right", title="Weather\nrating", fontsize="x-small");
As expected, bike rentals are low in bad weather. However, notice the variation within a week for the bad weather category. It is quite clear that people do not go biking on a bad weather Sunday since they have a choice not to go out!
Let's rerun the cell above by replacing the ax.hist
function with the following code snippet. This helps us create a stacked bar chart.
ax2.set_ylim(*(0,5e5))
ax2.hist(day_data,
label=weathers,
weights=weights_data,
bins=np.linspace(-0.5, 6.5, 8),
ec="k", lw=1.,
histtype="barstacked",
rwidth=0.8
)
A note on this new code: The histtype="barstacked"
stacks the data on top each other. The new parameter rwidth=
sets the ratio between the bar width and the bin width, which is the way you put spaces between bars. We disable log
scale so that the natural totals are more clearly seen. Notice the extreme difference between rentals in different weather conditions.
There is hardly any differences between work days and weekends although we can detect a slight increase through the week. People do love the outdoors!
To summarize the teaching points above:
hist
will do it for you. color
, lw
, ec
, alpha
, etc...normed
. histtype=
to either bar
or barstacked
. weights=
parameter if you have a summarized bin count already. hist2d
and the hexbin
plotWe use hist2d
to visualize the joint distribution of bivariate data. When we expect correlation between two variables, a two dimensional histogram helps us reveal the structure of that relationship and avoids overplotting.
Let's return to the PRSA data set. Recall that a scatterplot suffers from overplotting. In order to circumvent this and still get useful insight into the data, we use hist2d
.
In [ ]:
PRSA = PRSA.dropna() #drop missing data
fig3, ax3 = plt.subplots(figsize=(8,6))
ax3.grid(b="off")
ax3.set_facecolor("white")
ax3.set_title("PM2.5 readings distribution by temperature")
ax3.set_ylabel("$\mu g$/$m^3$")
ax3.set_xlabel("Daily temperature (C$^\circ$)")
ax3.hist2d(PRSA.TEMP, PRSA["pm2.5"],
bins=55,
cmin=5,
range=[[-15,40],[0,600]],
cmap="Blues");
This chart is more informative that a simple scatterplot. For one, we now know that there are two modes in the distribution of temperature and pollutant. Furthermore, there is more variation in pollution levels in the colder seasons as compared to warmer days.
Let's try to understand how this plot was created.
fig3, ax3 = plt.subplots(figsize=(8,6))
We initialize a figure object of width=8 units and height=6 units and an axis object to contain our histogram.
ax3.grid(b="off")
ax3.set_facecolor("white")
ax3.set_title("PM2.5 readings distribution by temperature")
ax3.set_ylabel("$\mu g$/$m^3$")
ax3.set_xlabel("Daily temperature (C$^\circ$)")
These codes set the axis grid to invisible and the background color to white. The rest are plotting information: plot titles and the units used on each axes.
ax3.hist2d(PRSA.TEMP, PRSA["pm2.5"],
bins=55,
cmin=5,
range=[[-15,40],[0,600]],
cmap="Blues");
Finally, this is the command to plot the histogram. Since this is a 2d histogram, we pass two arrays of data, first for the x-axis and then for the y-axis.
bins=
parameter is set to 55. That means there are 55 bins in each dimension. Again, too high a number leads to overfitting and creates a very noisy chart. So choose a suitable number so as not to lose too much information. cmin=
controls which frequency counts are displayed. If a particular frequency count is below cmin
, it is not plotted. range
parameter sets the expected range of the data in each dimension. Data points which are outside the range are considered outliers and not tallied. cmap
controls the color scheme used to indicate frequency counts. There are many color schemes to choose from and all can be seen here. "Blues"
is a type of color scheme known as a sequential color scheme. Use this for measures like frequency counts where the contrast between min and max is important. This plot is still not perfect. For example, it would be nice to have an way to tell the frequency counts at each color level. To do this, we add a color bar to the plot.
Modify ax3.hist(PRSA.TEMP, ...
to the following:
img = ax3.hist(PRSA.TEMP, ...
This saves the 2d histogram image(along with other supplementary information) in the variable img
. Let's see what img
contains.
In [ ]:
img
If you observe, img
is a tuple of length 4. The last entry of the tuple is the image data. Next, after the last line of ax3.hist
command, add in the function
cbar = plt.colorbar(img[3], ax=ax3)
This means that we are now going to plot a color bar in ax3
(that's what ax=ax3
means) using the image data from our 2d histogram (which is why we must pass img[3]
as an argument to plt.colorbar
. We save the created colorbar instance in a variable named cbar
so that we may further customize it.
To add a title to the colorbar, add in the following line:
cbar.set_label("Frequency")
This is what you should see if everything is done correctly.
In [11]:
In [ ]:
fig4, ax4 = plt.subplots(figsize=(9,6))
ax4.grid(b="off")
ax4.set_facecolor("white")
ax4.set_title("PM2.5 distribution by temperature")
ax4.set_ylabel("$\mu$g/$m^3$")
ax4.set_xlabel("Temperature (C$^\circ$)")
img = ax4.hexbin(PRSA.TEMP, PRSA["pm2.5"],
gridsize=55,
mincnt=5,
# bins="log",
cmap="Blues",
)
cbar = plt.colorbar(img, ax=ax4)
#cbar.set_label("$\log($Frequency)")
cbar.set_label("Frequency")
However, hexbin plots differ from the square 2d histograms in more ways than the type of tiling used. We may use hexbin plots to investigate how a dependant variable depends on 2 independant variables. Just as we passed bin frequencies to the weights
parameter in hist
, we pass the third dependant variable to the C
parameter in hexbin
. That means, we can visualize a two dimensional surface embedded in a 3D space as an altitude map.
hexbin
C
parameterLet's investigate how PM2.5 pollutants vary with temperature and atmospheric pressure in the PRSA dataset. To do that type in (or copy paste) the following code
fig5, ax5 = plt.subplots(figsize=(8,6))
ax5.grid(b="off")
ax5.set_facecolor("white")
ax5.set_title("PM2.5 pollutants as a function of temperature\nand atmospheric pressure")
ax5.set_xlabel("Temperature (C$^\circ$)")
ax5.set_ylabel("Pressure (hPa)")
img = ax5.hexbin(PRSA.TEMP, PRSA.PRES, C=PRSA["pm2.5"],
gridsize=(30, 20),
cmap="rainbow")
cbar = plt.colorbar(img, ax=ax5)
cbar.set_label("$\mu$g/m$^3$")
In [ ]:
# Paste or type in your code here
The color for each hexagon is determined by the mean value of each PM2.5 readings corresponding to the pressure and temperature readings contained in each hexagon.
However, the aggregation function on each hexagon can be changed by specifying another function to argument reduce_C_function
. Let's change this by passing the following code to hexbin
.
reduce_C_function=np.median
Exercise: Change the label of the color bar to indicate that we are taking the median PM2.5 readings in each hexagon.
In [16]:
fig5, ax5 = plt.subplots(figsize=(8,6))
ax5.grid(b="off")
ax5.set_facecolor("white")
ax5.set_title("PM2.5 pollutants as a function of temperature\nand atmospheric pressure")
ax5.set_xlabel("Temperature (C$^\circ$)")
ax5.set_ylabel("Pressure (hPa)")
img = ax5.hexbin(PRSA.TEMP, PRSA.PRES, C=PRSA["pm2.5"],
gridsize=(30,20),
,
cmap="rainbow",
)
cbar = plt.colorbar(img, ax=ax5)
# Write your answer below
In [ ]:
fig5, ax5 = plt.subplots(figsize=(8,6))
ax5.grid(b="off")
ax5.set_facecolor("white")
ax5.set_title("PM2.5 pollutants as a function of temperature\nand atmospheric pressure", y=1.05)
ax5.set_xlabel("Temperature (C$^\circ$)")
ax5.set_ylabel("Pressure (hPa)")
img = ax5.hexbin(PRSA.TEMP, PRSA.PRES, C=PRSA["pm2.5"],
gridsize=(30,20),
bins="log",
cmap="rainbow"
)
cbar = plt.colorbar(img, ax=ax5)
cbar.set_label("$\log_{10}(P)$\n$P \;\mu m$/$m^3$")
By passing "log"
to the bins
parameter, we normalize the color scale so that a color corresponds to $\log(x+1)$ where $x$ is the aggregated C
value of each hexagon. As you can see above, this may help bring out fine details between lumonosity levels. Note that the data has not be normalized, only the color correspondence.
bins="log"
cbar.set_label("$\log_{10}(P)$\n$P \;\mu m$/$m^3$")
We could also pass a single int
, $n$, to the parameter so that hexbin
creates $n$ of such color levels.
import matplotlib as mpl
d_rainbow = mpl.cm.get_cmap("rainbow", 5)
ticks=[0.4, 1.2, 2.0, 2.8, 3.6]
...
bins=5
cmap=d_rainbow
...
cbar = plt.colorbar(img, ax=ax5, ticks=ticks)
cbar.set_label("Pollutant level")
cbar.set_ticklabels(["%d" % (x) for x in range(1,6)])
By passing a list of int
's, we can set the binning boundaries explicitly. Note that we are setting the lower bound of these bins.
import matplotlib as mpl
d_rainbow = mpl.cm.get_cmap("rainbow", 4)
...
bins=[0, 50, 100, 200],
cmap=d_rainbow
...
cbar = plt.colorbar(img, ax=ax5, extend="max", ticks=[1, 1.75, 2.5, 3.25])
cbar.set_label("$\mu$m/$m^3$")
cbar.set_ticklabels(["$\geq 0$", "$\geq 50$", "$\geq 100$", "$\geq 200$"])
Exercise: The orientation of the colorbar can be changed using the orientation
parameter. There are two settings: "vertical"
and "horizontal"
. Change the orientation of the color bar so that it is parallel to the $x$-axis.
Extra: The resulting figure may be out of shape because matplotlib steals some space from the parent axis to place the color bar. Reshape the figure using the parameter figsize=(..., ...)
so that the graph is more presentable.
In [ ]: